Maximum Entropy based Natural Language Interface for Relational Database
ثبت نشده
چکیده
A Natural Language Interface for DataBase (NLIDB) is a system which accepts the user request in natural language and converts to an SQL query. The system consist of: Language Processor (LP) and Query Translator. LP is used to extract information from user query. Main LP techniques used in our system are Part-of-Speech (POS) Tagging and Chunking that are implemented by Maximum Entropy Models. Query Translator (QT) is used to formulate SQL queries. It has predefined query templates which are selected based on constrains, connectors etc. specified in a user query. Finally the SQL Query is obtained by completing the selected query template with the already identified details like attributes, conditions etc. KeywordsNLIDB, POS Tagging, Maximum Entropy, QT. INTRODUCTION Access of data stored in databases has always been a problem for regular users who are commonly unaware of query languages. Researches have been going on in this area from the late 1960’s. These researches were aimed at making a Natural Language Interface for Database (NLIDB) so that users can query the database directly without query language knowledge. NLIDB let users to query the database in formal English language and it translates query into proper SQL queries. Early NLIDB systems had many roles in interface based query processing. LUNAR was developed as an interface to the database that held information about rocks collected during American moon expeditions [1] and LADDER is a semantic 70 Deepthi S, Rejimoan R and Vinod Chandra S S grammar based database interface of the US Navy ships database [2]. CHAT-80 interfaced the world’s geography facts database [2]. All these systems are either domain oriented or developed to serve a single database. Some new NLIDB systems are Semantic Grammar based System [3], Synchronous Context Free Grammar (SCFG) based System [4], PCFG based System [5], WordNet based System [6], Conversation-based System [7] etc. All these systems accept natural language queries, but required some pre-requisites or they might demand user support. For example, semantic grammar based system requires a set of grammars to be defined, WordNet based system requires an ontology and Conversation based System demands user to communicate with the system till enough information for query formulation is collected. All these systems fail to formulate query accurately and hence provide incorrect results [8]. In order to overcome this flaw, an NLIDB system is implemented using of a semiparser and Maximum Entropy machine learning model which makes predictions only based on known facts. This method increases accuracy of queries formed and eliminates formation of multiple trees or grammars for the same request. The NLIDB system described in this paper has two parts: Language Processor (LP) and Query Translator (QT). Figure 1 shows the NLIDB architecture. Language Processor is used to identify constraints, predicates etc. The main components of LP are Tokenizer, POS Tagger, Name Identifier and Chunker. Query Translator holds several query templates and is responsible for formulating the correct SQL query that match the user request. Figure. 1. NLIDB Architecture Maximum Entropy based Natural Language Interface for Relational Database 71 LANGUAGE PROCESSOR Language Processor (LP) is used to analyze user’s query, given in an English text. This is first tokenized (sentences are identified and then each sentence is split to words) then passed to POS Tagger. The tagger identifies linguistic category of all words which helps to understand the request. The chunker ensures that no details of given request are miss interpreted. The request and all gathered information are passed to Query Translator for query formulation. TOKENIZATION We split a given text into sentences and end of the sentences is identified by presence of a full stop, question mark or an exclamation mark. If the end of sentence is a fullstop then the abbreviations need to be processed separately. Algorithm1 describes abbreviation check on a text. Algorithm 1: Begin Let the current End-of-Sentence condition (.) be p Identify current token, nc and next token, nt. If nc is an abbreviation. If nt, is an end-of-sentence condition, then nt indicates end of current sentence. Perform sentence split Else nt indicates next token of current sentence. Else if nc not an abbreviation if suffix is whitespace then nt indicates end of current sentence. Perform sentence split. Else if suffix is character, then p is part of the token End; The sentences are tokenized one at a time. Generally tokens are identified by white spaces between words. Special symbols like full stop, question mark, exclamation mark, back slash, double quotes, single quotes, comma, opening and closing braces are also considered as tokens. These tokens are grouped to fix a partof-speech (POS) tag for a word. For example, Consider the user query “List all details of students who joined on 12/3/2013” Equivalent tokens are List | all | details | of | students | who | joined | on | 12/3/2013 72 Deepthi S, Rejimoan R and Vinod Chandra S S POS TAGGING Part-of-Speech Tagging marks up a word into corresponding lexical category (verb, noun, adjective, adverb etc.). The tagger makes use of PENN Treebank POS tag set developed by the University of Pennsylvania for NLP related research work [9]. It is a standard tag set accepted around the world. The tagger uses a maximum entropy model trained over the PENN Tag set. Tagger uses WordNet as the underlying dictionary [10]. Maximum entropy based learning is used for predicting tags of words [11]. The feature selected is in the format (a, b) where ‘a’ is the possible tag, ‘b’ is the current word and previous two tags. Tag prediction for each word is made by considering the history ‘h’ (sequence of tags assigned to all previous words of the sentence). Each pair (a, b) has a probability p(a, b) and a tag is selected for ‘a’ such that it maximizes the entropy H(p) [11]. H(p) is computed using the Shannon’s Entropy equation [12].
منابع مشابه
Approximate Maximum-Entropy Integration of Syntactic and Semantic Constraints
Statistical approaches to natural language parsing and interpretation have a number of advantages but thus far have failed to incorporate compositional generalizations found in traditional structural models. A major reason for this is the inability of most statistical language models being used to represent relational constraints, the connectionist variable binding problem being a prominent cas...
متن کاملThe Remit System for Paraphrasing
1. ARsTRAm. REMIT Relational Model Interpreter and Translator is a formal query language to natural language interpreter designed to aid query verification in a relational database environment. The system has been developed to work in conjunction with the ICL natural language query interface, NEL, which translates English query expressions into the formal query language QUERYMASTER. Funding for...
متن کاملA Web Based Tool for Accessing Distributed Relational Databases through Multilingual Fuzzy Interface
Most of the relational database systems use SQL, which has strict syntax and semantics defined precisely to retrieve data. Sometimes, user may not be aware of SQL syntax and concepts and wants to use databases without any technical effort. In recent years, an increasing number of people have begun to realize the need for a technology to reach beyond the barriers of SQL. Some of papers have give...
متن کاملNatural language Interface for Database: A Brief review
Information is playing an important role in our lives. One of the major sources of information is databases. Databases and database technology are having major impact on the growing use of computers. Almost all IT applications are storing and retrieving information from databases. Retrieving information database requires knowledge of database languages like SQL. The Structured Query Language (S...
متن کاملNatural Language Interface to Database using Semantic Matching
Information is playing an important role in our lives. One of the major sources of information is databases. Databases and database technology are having major impact on the growing use of computers. In order to retrieve information from a database, one needs to formulate a query in such way that the computer will understand and produce the desired output. The Structured Query Language (SQL) no...
متن کامل